FastAI Lecture 03

Notes

FastAI

History

Neural Network

Theory

Author

Agastya Patel

Published

January 13, 2024

Modified

January 13, 2024

Rectified linear Unit : y = mx + b

Calculating Loss

What are derivatives?

Derivatives define the rate of change for the particular function at that particular point of parameter. > In machine learning key is to know how to change the parameter (weights) of a function to reduce the loss. We can use derivatives as it gives us the understanding of change which would take place on altering weights. Calculus provides derivatives which can help us create gradients of the function - fastbook

Calculating derivatives for weights in NN

For neural networks with lots of weights, we find derivatives for each weight, treating others as constants. In deep learning, “gradients” mean values of a function’s derivatives. PyTorch’s requires_grad_() helps track and calculate these derivatives automatically.

def f(x): return x**2

xt = tensor(3.).requires_grad_()

## Calculating function with the value 
yt = f(xt)
yt
>>tensor(9., grad_fn=<PowBackward0>)

## Asking pytorch to calculate gradient for us
yt.backwards()
# The "backward" here refers to _backpropagation_, which is the name given to the process of calculating the derivative of each layer.

xt.grad
>> tensor(6.)

derivative of f(x) = x^2 is 2*x We found the same value with the xt.grad (gradient)

The gradients only tell us the slope of our function, they don’t actually tell us exactly how far to adjust the parameters. But it gives us some idea of how far; if the slope is very large, then that may suggest that we have more adjustments to do, whereas if the slope is very small, that may suggest that we are close to the optimal value. - fastbook

Loss vs Metric

Aspect	Metric	Loss
Purpose Difference	Drives human understanding of performance	Drives automated learning by optimization
Smoothness Requirement	Not constrained by smoothness	Requires smoothness for meaningful derivative
Optimization vs. Real Goal	Reflects actual goals	Compromise between real goals and optimization
Calculation Process	Provides overall model evaluation	Calculated per item, averaged at epoch end
Focus Consideration	Primary focus for judging performance	Important for automated learning, may not directly represent end goal

Why Batches?

After loss function calculation; When should the system update weights? if loss is calculated for one item it would not be much informational as it would result in imprecise and unstable gradient if loss is calculated for entire dataset it would take very long

Mini Batch

So, we count the average loss for few data items at a time (Mini Batch) BatchSize = Number of items

Batch Size	Quality	Time	Size
Larger	more accurate and stable estimate of your dataset’s gradients from the loss function	longer time to process	will process fewer mini-batches per epoch

NOTE: We can’t use large batch size due to limitation of GPU memory

Randomization with mini batches

Dataset creates list of input-label tuples which is passed into DataLoaders both in PyTorch and FastAI so that random mini batches can be created

ds = L(enumerate(string.ascii_lowercase))
ds
>> (#26) [(0, 'a'),(1, 'b'),(2, 'c'),(3, 'd'),(4, 'e'),(5, 'f'),(6, 'g'),(7, 'h'),(8, 'i'),(9, 'j')...]

dl = DataLoader(ds, batch_size=6, shuffle=True)
list(dl)
>> [(tensor([17, 18, 10, 22,  8, 14]), ('r', 's', 'k', 'w', 'i', 'o')),
 (tensor([20, 15,  9, 13, 21, 12]), ('u', 'p', 'j', 'n', 'v', 'm')),
 (tensor([ 7, 25,  6,  5, 11, 23]), ('h', 'z', 'g', 'f', 'l', 'x')),
 (tensor([ 1,  3,  0, 24, 19, 16]), ('b', 'd', 'a', 'y', 't', 'q')),
 (tensor([2, 4]), ('c', 'e'))]

Term	Meaning
ReLU	Function that returns 0 for negative numbers and doesn’t change positive numbers.
Mini-batch	A small group of inputs and labels gathered together in two arrays. A gradient descent step is updated on this batch (rather than a whole epoch).
Forward pass	Applying the model to some input and computing the predictions.
Loss	A value that represents how well (or badly) our model is doing.
Gradient	The derivative of the loss with respect to some parameter of the model.
Backward pass	Computing the gradients of the loss with respect to all model parameters.
Gradient descent	Taking a step in the directions opposite to the gradients to make the model parameters a little bit better.
Learning rate	The size of the step we take when applying SGD to update the parameters of the model.